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Abstract 

Multipoint polynomial evaluation and interpolation are fundamental for modern symbolic and nu¬ 
merical computing. The known algorithms solve both problems over any field of constants in nearly 
linear arithmetic time, but the cost grows to quadratic for numerical solution. We fix this discrepancy: 
our new numerical algorithms run in nearly linear arithmetic time. At first we restate our goals as the 
multiplication of an n x n Vandermonde matrix by a vector and the solution of a Vandermonde linear 
system of n equations. Then we transform the matrix into a Cauchy structured matrix with some special 
features. By exploiting them, we approximate the matrix by a generalized hierarchically semiseparable 
matrix, which is a structured matrix of a different class. Finally we accelerate our solution to the original 
problems by applying Fast Multipole Method to the latter matrix. Our resulting numerical algorithms 
run in nearly optimal arithmetic time when they perform the above fundamental computations with 
polynomials, Vandermonde matrices, transposed Vandermonde matrices, and a large class of Cauchy 
and Cauchy-like matrices. Some of our techniques may be of independent interest. 

Key words: Polynomial evaluation; Rational evaluation; Interpolation; Vandermonde matrices; Transfor¬ 
mation of matrix structures; Cauchy matrices; Fast Multipole Method; HSS matrices; Matrix compression 
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1 Introduction 

1.1 The background and our progress 

Multipoint polynomial evaluation and interpolation are fundamental for modern symbolic and numerical 
computing. The known FFT-based algorithms run in nearly linear arithmetic time, but need quadratic time 
if the precision of computing is restricted, e.g., to the IEEE standard double precision (cf. [BFOO) . [BEGO08]). 
Our algorithms solve the problems in nearly linear arithmetic time even under such a restriction. 

At first we restate the original tasks as the problems of multiplication of a Vandermonde matrix by a 
vector and the solution of a nonsingular Vandermonde linear system of equations, then transform the input 
matrix into a matrix with the structure of Cauchy type, and finally apply the numerically stable FMM to 
a generalized HSS matrix that approximates the latter matrix^ “Historically HSS representation is just a 
special case of the representations commonly exploited in the FMM literature” |CDG06| . We refer the reader 
to the books |B10| . [VVMj . |EGH13| . and the bibliography therein for the FMM and the HSS matrices. 

Our resulting fast algorithms apply to the following computational problems: 

*Some results of this paper have been presented at ILAS’2013, Providence, RI, 2013, at CASC’2013, Berlin, Germany, 2013, 
and at CSR’2014, Moscow, Russia, 2014. 

^ “HSS” and “FMM” are the acronyms for “Hierarchically Semiseparable” and “Fast Multipole Method”. 
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• multipoint polynomial evaluation and interpolation, 

• multiplication by a vector of a Vandermonde matrix, its transpose, and, more generally, matrices with 
the structures of Cauchy or Vandermonde type, 

• the solution of a linear system of equations with these coefficient matrices, 

• rational interpolation and multipoint evaluation associated with Cauchy matrix computations. 

Some of our techniques can be of independent interest (cf. their extension in [PIG]). 

As in the papers [MRT05] , [CCS07| , [XXC12] , and [XXCB14] , we count arithmetic operations in the field 
C of complex numbers with no rounding errors, but our algorithms are essentially reduced to application of 
the celebrated algorithms of FFT and FMM, having stable numerical performance. 


1.2 Related works and our techniques 

Our progress can be viewed as a new demonstration of the power of combining the transformation of matrix 
structures of \P90f with the FMM/HSS techniques. 

The paper [P90) has proposed some efficient techniques for the transformation of the four most popular 
matrix structures of Toeplitz, Hankel, Cauchy, and Vandermonde types into each other and then showed 
that these techniques enable us to readily extend any efficient algorithm for the inversion of a matrix having 
one of these structures to efficient inversion of the matrices having structures of the three other types. The 
papers |PSLT93] and |PZHY97] have extended these techniques to the acceleration of multipoint polynomial 
evaluation, but have not invoked the FMM and achieved only limited progress. Short Section 9.2 of |PRT92j 
has pointed out some potential benefits of combining FMM with the algorithm of the paper [G88| . but has 
not developed that idea. The papers |P95| and |DGR96) applied FMM and some other advanced techniques 
in order to accelerate approximate polynomial evaluation at a set of real points. 

The closest neighbors of our present study are the papers [MRT05| . |CGS07| . |XXG12] . |XXCB14] . 
and mB- The former four papers approximate the solution of Toeplitz, Hankel, Toeplitz-like, and Han- 
kel-like linear systems of equations in nearly linear arithmetic time, versus the cubic time of the classical 
numerical algorithms and the previous record quadratic time of |GK095j . All five papers |GK095j , [MRTOSj , 
[CGS07| . |XXG12j . and |XXCB14] begin with the transformation of an input matrix into a Cauchy-like one, 
by specializing the cited technique of [P90| . Then |GK095| continued by exploiting the invariance of the 
Cauchy structure in row interchange, while the other four papers apply the numerically stable FMM in order 
to operate efficiently with HSS approximations of the basic Cauchy matrix. 

We incorporate the powerful FMM/HSS techniques, but extend them nontrivially. The papers [GK095) . 
|MRT05| . |CGS07| . |XXG12| . and |XXCB14| handle just the special Cauchy matrix C = 
which m = n, {sqj • ■ • > Sn-i} is the set of the n-th roots of unity and {ffi,.. ., tn-i} is the set of the other 
2n-th roots of unity. Our fast Vandermonde multipliers and solvers bring us to a subclass of Cauchy matrices 
C' = ( g-it- rather than to a single matrix: we still assume that the knots to,. .. ,t„_i are equally 

spaced on the unit circle, but impose no restriction on the knots sq, ..., Sm-i and arrive at the matrices 


Cs,/ = (-, 

\s^ - fuji / i,j=0 


m—l,n—1 


( 1 . 1 ) 


for any complex numbers /, sq, ■ ■ ■; Sm-i and uj = exp(27r-\/—1/n) denoting a primitive nth root of unity. 

We call the matrices Cg./ CV matrices, link them to Vandermonde matrices, and devise efficient approxi¬ 
mation algorithms that multiply a CV matrix by a vector, solve a nonsingular CV linear system of equations, 
and hence perform multipoint polynomial evaluation and interpolation. In order to achieve this progress, we 
work with extended HSS matrices, associated with CV matrices via a proper partition of the complex plane: 
we bound the numerical rank of the off-block-tridiagonal blocks (rather than the off-block-diagonal blocks, 
as is customary) and allow distinct rectangular blocks to share row indices. Extension of the FMM/HSS 
techniques to such matrix classes was not straightforward and required additional care. 

The paper |P15| revisited the method of the transformation of matrix structures (traced back to |P90) 1. 
recalled its techniques in some details, extended them, and finally outlined our present approach to polyno¬ 
mial interpolation and multipoint evaluation in order to demonstrate the power of that method once again. 
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The paper included only one half of a page to HSS matrices and about as much to the reduction of the 
polynomial evaluation and interpolation to computations with CV matrices. No room has been left for the 
description of nontrivial computations with generalized HSS matrices (having cyclic block tridiagonal part), 
to which the original problems are reduced. Furthermore the competing fast algorithms for polynomial and 
rational interpolation and multipoint evaluation of |MB72] . |H72] . and |GGS87] have not been cited. 

We fill this void by describing in some detail the omitted algorithms for generalized HSS computation, by 
linking polynomial and rational interpolation and multipoint evaluation to CV matrices, by demonstrating 
the inherent numerical instability of the algorithms of [MB72j , [H72j , and |GGS87) , and by presenting some 
numerical tests, in particular for comparison of numerical stability of our algorithms with that of |MB72) . 
Also we more fully and more clearly cover the approximation of CV matrices by generalized HSS matrices. 


1.3 Organization of our paper 

In the next section we recall some basic results for matrix computations. In Section|^we recall the problems of 
polynomial and rational evaluation and interpolation and represent them in terms of Vandermonde, Cauchy, 
and CV matrices. Sections and (on the Background) make up Part I of our paper. 

Sections and (on the Extended HSS Matrices) make up Part H, where at first we recall the known 
algorithms for fundamental computations with HSS matrices and then extend the algorithms to generalized 
HSS matrices having cyclic block tridiagonal part. Part H can be read independently of Section 

Sections an d [t] (on Computations with the CV Matrices and Extensions) make up Part HI of the 
paper. In Section iGwe approximate a CV matrix by generalized HSS matrices and estimate the complexity 
of the resulting numerical computations with CV matrices. In Section we comment on the extensions and 
implementation of our algorithms, in particular the extension to computations with Vandermonde matrices 
and polynomials. The results of Sectionj^imply our main results because we have already reduced polynomial 
interpolation and multipoint evaluation to computations with CV matrices in Part I and have elaborated 
upon fast computations with generalized HSS matrices in Part H. 

and equations (3.21 and (3.4) of Part I (which support the cited reduction to 


Part HI uses Section 3.2 


CV matrices) and Theorem 5.1 and Corollary 5.1 of Part H (where we estimate the cost of computations 
with generalized HSS matrices), but otherwise can be read independently of Parts I and H. 

Sections!^ andmake up Part IV of the paper. In Section]^ we report the results of our numerical tests. 
In Section [9|we briefly summarize our study. 



2.1 Some basic definitions for matrix computations 

O = Om,n is the m X n matrix filled with zeros. / = /„ is the n x n identity matrix. 

M'^ is the transpose of a matrix M, is its Hermitian transpose. 

diag(Ho,..., Hfc-i) = diag(Hj)*“g is a k x k block diagonal matrix with diagonal blocks Hg,..., Bk-i- 
Both {Bq ... Bk-i) and {Bq | ... | B^-i) denote a 1 x fc block matrix with k blocks Bq, ..., Bk-i- 
||M|| = ||M ||2 denotes the spectral norm of a matrix M. 

For an m X n matrix M = (mij)™Vg’"~^, write \M\ = max^j and so ||M|| < y/rrm \M\, but for a 

set S we write |5| to denote its cardinality. 

An m X n matrix U is unitary if U^U = In or UU^ = Im, and then ||t7|| = I. 

“<C” stands for “much less” quantified in context. 

2.2 Submatrices, rank, and generators 

An m X n matrix M has a nonunique generating pair (F, G^) of a length p ii M = for two matrices F 
of size m X p and G of size n x p. The minimum length of a generating pair of a matrix is equal to its rank. 
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Ti{B) and C{B) are the index sets of the rows and columns of its submatrix B, respectively. For two sets 
X C {1,..., m} and C {1,..., n}, define the submatrix B = such that 'R.{B) = X 

andC(B) = J. Write M(X,.) = M{I,J) \i J = {l,...,n}. Write = M{i,J) if X = {1,..., m}. 

Theorem 2.1. A matrix M has rank at least p if and only if it has a nonsingular px. p submatrix M{X, J). 
//rank(M) = p, then M = M{.,X)M{X,J)-^M{J, .). 

The theorem defines two generating pairs {M{.,X), MiX, J)~^ M{J,.) and {M{.,X)M{X, J)~^ ,M{J,.) 
and a generating triple {M{.,X), M{X,J)~^, M{J,.)) of a length p for a matrix M. We call such pairs 
and triples generators. One can obtain some generators of the minimum length for a given matrix by 
computing its SVD UYiV or its less costly rank revealing factorizations such as ULV and URV factorizations 
in [CGS07] , [XXG12] , and [XXGB14] , where the matrices U and V are unitary, S is diagonal, and L and R are 
triangular (cf. |GL13[ Section 5.6.8]). For efficient alternative techniques, some of which use randomization 
or heuristics, see |G()S08| . |GT01| . |HMT1H . |LWMRT| . [Mil], [Ml^ . |PQY15| , [TOO], [WTl] . |XXG12| . 
and the references therein. 

2.3 Small-norm approximation and perturbation 

Hereafter we deal with perturbations within a positive tolerance f. (One may think of machine epsilon, but 
in this paper we just assume that f is small in context.) 

A matrix M is a approximation of a matrix M if ||M — M|| < ^||M||. 

A ^-generator of a matrix M is a generator of its ^-approximation. 

The f-rank of a matrix M is the integer iniii||jQ^_j;^||<;j||jy^|| rank(M). 

A matrix M is ill-conditioned if its rank exceeds its numerical rank. 


3 Polynomial and rational evaluation and interpolation as opera¬ 
tions with structured matrices 

3.1 Four classes of structured matrices. Cauchy and Vandermonde matrices 

Recall the four classes of highly popular structured matrices, that is, Toeplitz matrices T = (tz-j)™^o" 
Hankel matrices H = Vandermonde matrices V = Vs = and Cauchy matrices 

( \ 771—l,n—1 

—A— ) . (Some authors call the transpose V'^ a Vandermonde matrix.) The mn entries 

^ ^ 0 

of such a structured m x n matrix are defined by at most m -\- n parameters. 

These classes have been extended to the four more general classes of matrices having structures of Toeplitz, 
Hankel, Vandermonde, and Cauchy types. Each such an mxn matrix is naturally defined by its displacement 
generator FH where F and G are m x d and d x n matrices, respectively, and where d min{m,n}, that 
is, min{TO, n} exceeds greatly the integer d (cf. m EE]). 

We mostly work with Vandermonde and Cauchy matrices and next recall some of their basic properties. 
The scalars sq, • ■ •, Sm-i, to, • • ■, tn-i define the Vandermonde and Cauchy matrices 14 and Cg.t, and we 
call them knots. If we shift the knots of a Cauchy matrix or scale them by a constant, we arrive at a Cauchy 
matrix again: aCas,at = Cs,t for a 7 ^ 0 and Cg+aea-i-ae = for e = (1 ,..., 1)"^. 

Theorem 3.1. (i) An mxn Vandermonde matrix 14 = (Si )7j4g"~^ rank if and only if all m knots 

( \ m—l,n—1 

1 is well defined and has full 

rank if and only if all its m -\- n knots are distinct. 

The four cited matrix structures have quite distinct features. In particular the matrix structure of Cauchy 
type is invariant in row and column interchange, in contrast to the structures of Toeplitz and Hankel types. 
This structure is stable in shifting and scaling its basic knots unlike the structure of Vandermonde type. 

The paper |P90| , however, has transformed the matrices of any of the four classes into the matrices of the 
three other classes simply by means of multiplication by Hankel, Vandermonde, and transposed or inverse 
Vandermonde matrices. Then the paper has showed that such transforms readily extend any efficient matrix 


4 






































inversion algorithm for matrices of one of the four classes to the matrices of the three other classes, and 
similarly for the computation of determinants and the solution of linear systems of equations. 

Presently we apply a simple specialization of this general technique for devising efficient approximation 
algorithms for Vandermonde matrix computations linked to polynomial evaluation and interpolation. 


3.2 Four computational problems 


Problem 1. Multipoint Polynomial evaluation or Vandermonde-by-vector multiplication. 

INPUT: m + n complex scalars po,... ,Pn-i', sq, ..., Sm-i- 

OUTPUT: m complex scalars Vq, ..., Vm-i satisfying Vi = p{si) for p{x) = Po + PiX + • • • + Pn-ix'^~^ and 
i = 0,..., m - 1 or equivalently Up = v for U = 14 = (s:’)™” p = {pjTjZl-, and v = 

Problem 2. Polynomial interpolation or the solution of a Vandermonde linear system. 
INPUT: 2n complex scalars vq, , u„_i; sq, ■ • ■; Sn-i, the last n of them distinct. 

OUTPUT: n complex scalars po, ■ ■ ■ ,Pn-i satisfying the above equations for m = n. 


Problem 3. Multipoint rational evaluation or Cauchy-by-vector multiplication. 

INPUT: 2m + n complex scalars Sqj • ■ •; Sm-iUo; • ■ ■ j ^n-i; ^o,.. •, Vm-i- 

OUTPUT: m complex scalars vq,. .., Vm-i satisfying Vi = s^-t i = 0, ..., m — 1 or equivalently 

m —l,n—1 


( \ ^ 


u = and v = (u,)™o^- 

Problem 4. Rational interpolation or the solution of a Cauchy linear system of equations. 

INPUT: 3n complex scalars Sq i i s„_i ; to, • • ■, tn- 1; '^oi i ^n-1 1 the first 2 n of them distinct. 

OUTPUT: n complex scalars uq, ..., it„_i satisfying the above equations for m = n. 


J=o 


3.3 The arithmetic complexity of Problems 1—4 

The algorithm of [MB72j solves Problem 1 by using 0((m + n) log^(n) log(log(n))) arithmetic operations. 
This complexity bound has been extended to the solution of Problems 2 in |H72] , 3 in [GGS87| , and 4 (see 
equation (3.11 below) and is within a factor of log(n) log(log(n)) from the optimum |BM75] . 

The cited algorithms supporting this bound require extended precision of computing and fail already 
for the input polynomials of moderate degree if the precision is restricted to the IEEE standard double 


precision (cf. Table 8.8). The approach relies heavily on computing with extended precision. Already the 


fast polynomial division algorithm requires computations with high precision for the worst case input, and 
the problem is aggravated in the recursive fan-in processes of polynomial multiplication and division in the 
algorithms of |MB72] . [H72] . and |GGS87| . Moreover, the following argument demonstrates that we must 
add at least n bits of precision when these algorithms compute the Lagrange auxiliary polynomial with the 
roots So,..., s„_i. 


Problem 5. Computation of the polynomial coefficients from its roots. 

INPUT: n complex scalars Sq, ..., s„_i. 

OUTPUT: the coefficients of the polynomial l{x) = n”=ro^(2^ ~ ^i)- 

In order to observe the need for the precision increase, notice that the constant coefficient has absolute 
value nj=o I'®*!’ which turns into 2” if, say, Si = 2 for all i, but the coefficient of has the order of 2" 

even if Si = 1 for all i. The restriction of using bounded (e.g., double) precision of computing rules out using 
the cited fast algorithms, and the known double precision algorithms for Problems 1-4 require quadratic 
arithmetic time (cf. |BF00| . [BEGOO^ l. 

This pessimistic outcome, however, does not apply to the important special case where the knots Si 
are the nth roots of 1, that is, where Si = w* for a; = = exp(27r-\/—1/n), i = 0,... ,n — 1. In this 

case, Us = and Problems 1 (for m = n) and 2 turn into the computation of the forward and 

inverse discrete Fourier transforms, respectively. Hereafter we use the acronyms DFT and IDFT and write 
U = :^(‘j-'*'^)i)7=o- Notice that U = and unitary matrices. Based on 

EFT, one can perform the DFT and IDFT, that is, can solve Problems 1 and 2 in this special case, by using 
bounded precision of computing and involving only 0(nlog(n)) arithmetic operations |P01[ Problem 2.4.2]. 
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3.4 Cauchy—Vandermonde links and their impact on Problems 1 and 2 

The following equation, traced to |K68| on |P01[ page 110], links Problems 1 and 2 to Cauchy matrices. 


Cs^t = diag(;(sd ^diag(Z'(tj))”^o\ l{x) = Y[{x - tj). 

j=o 


(3.1) 


For t = / • ^ 0, the knots tj are the scaled nth roots of 1, l{x) = x" — /", l'{x) = 


\n—1 

Ji=0 

\n-l T/-1 


Vt = y/n ndiag(/l)"rQ , diag(/“l)"ro 11^. Likewise for s = e • (w*)”ro^, e ^ 0, the knots Si are 

the scaled nth roots of 1, Vg = y/n fl diag(e*)"TQ^ and 14 “^ = ^ diag(e“l)”rQ 

Write Cgj = 0 and Ce,t = for e ^ 0 and obtain from (3.1) that 


K = 




( \ Ta— 1 

s'l - /") C's,/diag(w^)J‘rofldiag(/^)”ro, 




^ diag(e diag(l(e*))”Lo^C'e,t diag , and 

/ tn-l .ri-1 

= V«diag(/"^)”ron^ diag(a;"^)"roC'4) diag ( „ _ for m = n. 


(3.2) 

(3.3) 

(3.4) 


These expressions link Vandermonde matrices and their inverses to the m x n CV matrices Cgj of equation 

§ / \ n —l,m—1 

and the n x m CV^ matrices Ce,t = ~Ct,e = y eui^-t ) (^°^ ® ^ d), that is, Cauchy matrices 


with an arbitrary knot set T = {toi • ■ ■ j tn-i} and with the knot set S = {si = ew®, i = 0,... ,m — 1}. More 
details on the subjects of this section can be found in [Pb| . 


PART II: EXTENDED HSS MATRICES 


4 Quasiseparable and HSS matrices 


4.1 Quasiseparable matrices and generators 

Definition 4.1. Suppose that an m x n matrix M is represented as a k x k block matrix with a block 
diagonal E = (Eq, ..., Efc_i). Let x(E) denote the overall number of the entries of all its k diagonal blocks 
Eq, ..., Efc_i and let x(E) <§; mn, that is, let mn greatly exceed x(E). Furthermore let I and u denote the 
maximum ranks of the sub- and superdiagonal blocks of the matrix M, respectively. Then the matrix M is 
(Z,'u)-quasiseparable. By replacing ranks with ^-ranks we define a (^, 4 w)-quasiseparable matrix. 

The definition generalizes the class of banded matrices and their inverses: a matrix having a lower 
bandwidth I and an upper bandwidth u as well as its inverse (if defined) are (/, M)-quasiseparable. 

In order to operate with (Z, it)-quasiseparable matrices efficiently, one exploits their representation with 
quasiseparable generators, demonstrated by the following 4x4 example and defined below in general form. 


/ So 

SoTi 

S0B1T2 

SoBiB2T3\ 

PiQo 

El 

S1T2 

S1B2T3 

DjAiQo 

P2Q1 

E2 

S2T3 

\P3A2A1Q0 

P3A2Q1 

P3Q2 

E3 y 


By generalizing this example we arrive at the following definition. 
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Definition 4.2. (Cf. Table 4-1.) Suppose that an m x n matrix M is represented as a k x k block matrix 
with a block diagonal S = (Eg,..., Sfe_i) such that x(E) ^ mn. (We reuse these assumptions of Definition 


13) 

Furthermore suppose that a set {Ii ,... ,1^} partitions the set {1,..., m}; a set {Ji,..., J7fc} partitions the 
set {1,..., n}, and there exists a six-tuple {Pi, Qh, Sh, Ti, Ag, Bg{ such that M{Ii, Jh) = PiAi-i ■ • • Ah+iQh 
and M{Ih, Ji) = ShBh+i ■ ■ ■ Bi^iTi for 0 < h < i < k. 

Here Pi, Qh, and Ag are \Ii\ x U, Ih+i x \Jh\, and Ig+i x Ig matrices, respectively, and 
Sh, Ti and Bg are \Ih\ x Uh+i, Ui x \Ji\, and Ug x Ug+i matrices, respectively, 


for g = l,...,k-2, h = Q,...,k-2, i = l,...,k-l. 

Then the six-tuple {Pi, Qh, Sh, Ti, Ag, Bg} is an (t, M)-quasi-separable generator of the matrix M, and 
the integers I = maxgjtg} and u = 'ai&yih{uh} are the lower and upper lengths or orders of this generator. 


Table 4.1: The sizes of quasiseparable generators of Definition 4.2 


Pr 

Qh 

^9 

Sh 

T. 

Bg 

\Ii\ X k 

Ih-hl X \Jh\ 

Ig+l ^ 

\^h 1 ^ 

Ill X \Ji\ 

Ug X Ug+l 


Theorem 4.1. (Cf. \B10f . \ VVMf . \X12f . lEGHlSff . and the bibliography therein.) A matrix M is {l,u)- 
quasi-separable if and only if it has a (nonunique) representation via [I, u)-quasi-separable generators. 


By virtue of this theorem one can redefine the (Z, u)-quasiseparable matrices as those representable with 
the families of quasiseparable generators {Ph, Qi, Ag} and {Sh, Ti, Bg} that have lower and upper orders I 
and u, respectively. Definitions |4.1| and |4.2| provide two useful insights into the properties of these matrices. 


The third equivalent definition in Section [4~^ (cf. T heorem 4.5) provides yet another insight and is linked to 
the study of the Cauchy matrix in |CGS07] . |XXG12| . |XXCBl4] . Various definitions, equivalent or 

closely related to those above, have been introduced by a number of authors (cf. |VVMj . |B10j . |EGH13| . and 
the references therein). In particular the related study of 7J-matrices and iJ^-matrices in |H99| . [TOOj . |BH02) . 
|GHn3| . [B09], |BTn] . and references therein was the basis for the software libraries HLib, www.hlib.org, and 
H2Lib, http://www.h21ib.org/, https://github.com/H2Lib/H2Lib, developed at the Max Planck Institute 
for Mathematics in the Sciences. 


4.2 Operations with quasiseparable matrices: definitions and demonstration 

Next we cover some basic operations with matrices represented with (Z, u)-quasiseparable generators. 

Definition 4.3. Given diagonal blocks E^, <7 = 0,..., fc — 1, of an (I, u)-quasiseparable matrix M and (Z, u)- 
quasiseparable generators for all its sub- and super-diagonal blocks, let a{M) and /3{M) denote the arithmetic 
cost of computing the vectors Mu and M~^u, respectively, maximized over all normalized vectors u, |u| = 1, 
and minimized over all algorithms. Write fi{M) = 00 if the matrix M is singular. a^(M) and /3j(M) 
replace the bounds a{M) and (3{M), respectively, provided that instead of the evaluation of the vectors Mu 
and M~^u, respectively, we approximate them within the error bounds ^||Mu|| and ^||M“^u||, respectively. 

The straightforward algorithm supports the following bound. 

Theorem 4.2. a{M) < 2(m -\- n)p — p — m where a generating pair of length p defines an mx n matrix M. 

The following estimates for computations with quasiseparable matrices extend the well-known estimates 
in the case of banded matrices. 

Theorem 4.3. \DV98} . {H99} . \EGO0f . Suppose that an {I, u)-quasiseparable matrix M of size m x n is 
defined by its x Uq diagonal blocks E^, q = 0,... ,k — 1, such that ~ ^q=o ^9 “ 

s = + u){m -\- n)) and by the generators of length at most I and at most u for its sub- 

and superdiagonal blocks, respectively. 

(i) Then a{M) < 2 -I- u) -I- s) -I- 2Pk -\- 2u^k = 0{{l -\- u)(m -\- n)) and 

(ii) /3{M) = 0(^^_p((Z -|- u)^{l -I- u -b riq)nq -b n^)) if niq = Ug for all q and if the matrix M is nonsingular. 
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Example 4.1. (Cf. Figures 2 and 3.) Let us multiply by a vector v the matrix M of equation (4.1). 


(i) At first view it as 2x2 block matrix with diagonal blocks Ei = 
multiply the blocks 


and 


P2A1Q0 

P3A2A1QQ 


So 

PiQo 
P2Q1 
P 3 A 2 Q 1 


SoTi 

Si 


and So = 


S 2 

P3Q2 


S 2 T 3 

S3 


by two subvectors of the vector v. 
and E 2 of smaller sizes by 


'S0B1T2 S0B1B2T3 
S1T2 S1B2T3 

(ii) Then multiply the blocks SqTi, PiQo, S 2 T 3 , and P 3 Q 2 of the matrices Si 
four subvectors of the vector v. 

Perform the computations at both stages fast if the given generators of the blocks have small length. 

(Hi) Then multiply the four diagonal blocks Si, S 2 , S 3 , and S 4 by four subvectors of the vectors v. Perform 
these computations fast because the four blocks have a small overall number of entries. 

(iv) Finally obtain the vector Mv by properly summing the products. 


4.3 Fast multiplication with recursive merging of diagonal blocks: outline 

In Example |4.1| we multiply the matrix M by a vector by using generators for only 6 out of its 22 sub- and 
super-diagonal blocks. Next we extend the above demonstration to multiplication of a general quasiseparable 
matrix M by a vector by using a small fraction of all generators. 

Definition 4.4. Suppose that M = {Mq | ... | Mfc_i) is a lx k block matrix with k block columns Mq, each 
partitioned into a diagonal block S^ and a neutered block column Nq, q = 0,... ,k — 1 (cf. our Figures 1-3 
and IMRTOIH Section 1]). Such a matrix is p-neutered if its every neutered block column N is represented as 
N = FF[ or N = FSFl where F of size h x r, S of size r x r, and FI of size r x k are its generator matrices 
and r < p. Call such a pair or triple a length r generator of the neutered block N and call r its length. A 
f-approximation of such a matrix is called (^, p)-neutered. 

In Figure 1 the diagonal blocks are black and the neutered block columns are gray or white. 

FIGURE 1 



Figure 1: FIGURE 2 









In Figure 2 the diagonal blocks from Figure 1 (marked by black color) are merged pairwise into their 
diagonal unions, each made up of four blocks. Two of them (from Figure 1) are marked by black color, and 
the two other by gray color. The new neutered block columns are either white or gray, but their gray color is 
lighter. The new (larger) diagonal blocks of Figure 2 are merged pairwise into the diagonal blocks of Figure 
3, each made up of two black and two gray blocks, and its two neutered block columns are white. 

FIGURE 2 
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FIGURE 3 
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Theorem 4.4. Suppose that an m x n matrix M is a p-neutered k x k block matrix and that we are given 
k generators of length at most p for all its k neutered block columns as well as all the x(5j) entries in the k 
diagonal blocks Eq: ■ • ■ j Then 

a{M) < 2x(E) + (2m + 2n — l)kp = 0(x(E) + (m + n)kp). 


Proof. Multiply the diagonal blocks by vectors in the straightforward way and multiply the neutered block 
columns by vectors by using the representation with generators. 

Formally write M = M' + diag(Eq)^'“g. Notice that q;(M) < 2x(E) + a{M') + m. The neutered block 
columns of the matrix M share their entries with the matrix M\ whose other entries are zeros. So the k 
pairs {Fq, Gq), ..., {Fk-i,Gk-i) together form a single generating pair of a length at most kp for the matrix 
M'. Therefore a{M') < (2m + 2n — l)kp — m by virtue of Theorem 4.2 □ 


The upper bound on a{M) of Theorem 


4.4 


is sufficiently small unless the integers k or x(S) are large. 
Unfortunately we cannot bound both of these integers at once, but we can circumvent the problem by 
applying the algorithm of Theorem |4 . 4| recursively. We begin with a partition of the matrix M defined by a 
few diagonal blocks that are p-neutered matrices themselves. Then we multiply neutered block columns fast 
(by using their generators), partition the diagonal blocks into smaller diagonal blocks and neutered block 
columns, and apply the same techniques recursively until we decrease the overall number of entries of the 
remaining diagonal blocks below a fixed tolerance bound of order m -I- n or (m -I- n)p. 

We can begin with k = 2 and x(^) ~ 0.5n^ and then double the integer k and roughly halve the integer 
x(E) in every recursive step. Then overall we deal with only 0(m -I- n) neutered block columns and their 
generators and therefore multiply the matrix M by a vector by using 0((m -|- n)p) arithmetic operations in 
all these recursive steps, thus matching the cost bounds in part (i) of Theorem 4.3 


4.4 HSS and balanced HSS matrices and the cost of basic operations with them 

Let us supply formal dehnitions and formal derivation of the latter estimates by applying the recursive 
process in the opposite direction, where at first the integer k is large and then is recursively doubled, while 
the diagonal blocks are small at first and then are merged recursively pairwise. 

Definition 4.5. Fix two positive integers I and q such that I + q < k and then merge the I block columns 
Mq, Mqj^i ,..., Mqj^i_i, the I diagonal blocks E^, E,j_|_i,..., E,j_|_;_i, and the I neutered block columns Ng, fVg_|_i, 
... into their union Mqg = M(., Uj“QC(Eq+j)), their diagonal union F,qg, and their neutered union 

Nqg, respectively, such that TZ(Eq^i) = Uj“Q7?.(Eg+^) and every block column Mqg is partitioned into the 
diagonal union Hg i and the neutered union Ngj. 

Define recursive merging of all diagonal blocks Eg,..., Efc_i by a binary tree whose leaves are associated 
with these blocks and whose every internal vertex is the union of its two children (see Figure 4). For every 
vertex v dehne the sets L{v) and R{v) of its left and right descendants, respectively. If 0 <\L{v)\-\R{v)\<l 
for all vertices v, then the binary tree is balanced and identihes balanced merging of its leaves, in our case the 
diagonal blocks. We can uniquely define a balanced tree with n leaves by removing the 2^^"^ — n rightmost 
leaves of the complete binary tree that has 2^*^”^ leaves for l{n) = |"log 2 (n)]. All leaves of the resulting heap 
structure with n leaves lie in its two lowest levels. 

FIGURE 4: Balanced merging of diagonal blocks. 
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So,1,2,3,4,5.6,7 



Definition 4.6. (i) A block matrix is a balanced p-HSS matrix if it is p-neutered throughout the process of 
balanced merging of its diagonal blocks, that is, if all neutered unions of its neutered block columns involved 
into this process have ranks at most p. This is a p-HSS matrix if it is p-neutered throughout any process of 
recursive merging of its diagonal blocks. 

(ii) By replacing ranks with ^-ranks we define balanced (5,/ 3 )-HSS matrices and (^,p)-HSS matrices. 

Fact 4.1. (i) Let a matrix be pj-neutered at the j-th step of recursive balanced merging for every j. Then 
this is a balanced p-HSS matrix for p = maxj pj. 

(ii) Likewise, let a matrix be j, pj)-neutered at the j-th step of recursive balanced merging for every j. 
Then this is a balanced (f, p)-HSS matrix for f = maxj and p = max^ pj. 

Theorem 4.5. (i) Every (l,u)-guasiseparable matrix M is an (I + u)-HSS matrix. 

(ii) Every p-HSS matrix is {p, p)-quasiseparable. 

Proof. A neutered block column Nq can be partitioned into its block sub- and superdiagonal parts Lq and Uq, 
respectively, and so rank(A"q) < rank(Lq) -|- rank{Uq). This implies that rank(A"q) < l-\-u for q = 0,..., k — 1 
if the matrix M is (1, u)-quasiseparable, and part (i) is proven. 

Next consider the union N of any set of neutered block columns of a matrix M. It turns into a neutered 
block column at some stage of appropriate recursive merging. Therefore rank(A^) < p where M is a p-HSS 
matrix. Now, for every off-diagonal block H of a matrix M, define the set of its neutered block columns that 
share some column indices with the block B and then notice that the block H is a submatrix of the neutered 
union of this set. Therefore rank(H) < rank(iV) < p, and we obtain part (ii). □ 


By combining Theorems |4.3| and |4.5| we obtain the following results. 

Corollary 4.1. Assume a p-HSS matrix M given with mq x Uq diagonal blocks Eg, q = 0,..., k — 1, and 


write m = Y!(i=o mq, n = Y.q=o ^ = J2g=o m„n„ 

^k-l 


jq=0 


Then 


(i) a{M) < 2s -I- 4p^fc -|- 4 Y)qZo (mq -b nq)p = 0((m -j- n)p -b s) and 

(ii) I3{M) = 0(J)^g~g((p -b nq)p^nq -hn^)) if mq = Ug for all q and if det{M) ^ 0. 

For a balanced p-HSS matrix M we only have a little weaker representation than in Theorem |4.1[ and 


so the proof of the estimates of Corollary 4.1 for a{M) and /3{M) does not apply, but next we extend these 
bounds. Unlike Theorem 4.3 and Corollary |4.1[ we allow mq Uq for all q. 


Theorem 4.6. Assume a balanced p-HSS matrix M with mq xuq diagonal blocks 5 = 0,..., k — 1, having 
s = entries overall and write I = |"log 2 (A;)], m = = ^ 5=0 m+ = max^“gmq, 

n+ = maXg“g Uq, and s < min{m+n,mn+}. 


^ij=0 "-g; 

(i) Then 


a{M) < 2s -b (to -b 4 (to -b n)p)l. 

(ii) If m = n and if the matrix M is nonsingular, then 

j5{M) = Oiri+s + {n\ + pn+ -b lp^)n + ikp + n)p^). 


(4.2) 


(4.3) 


(Hi) The same bounds (4.2) and hold for the transpose of a balanced p-HSS matrix M matrix having 

Uq X mq diagonal blocks Eg for q = 0,... ,k — 1. 
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Proof. Let us readily prove part (i) by just counting the arithmetic operations involved in recursive merging. 

With no loss of generality assume that the {I — l)st (that is, hnal) stage of a balanced merging process 
has produced a 2 x 2 block representation 


M = 


^ ylO q{l)rp(P 

qil)rp(l) 


where is an rri'p x rSP matrix, 


] 

4'^ 


T, 


(0 


-(0 -(0 i ■ 

IS an n ^-' x py matrix, 


4.1) 


■n'ry' =n. Clearly a{M) <m + + J2j=o + “(‘S'oi 


44 'i 


Apply Theorem 
The second last 


4.2 


and obtain that X]i=o + '^)P- 


(1)^ 


< P, j 

yi) 

(0 


= 0 , 1 , 

h 0(544- 


44 _L^(b - 


and 


stage of the balanced merging process produces a similar 2 x 2 block representation for 


each of the diagonal blocks S,- , j = 0,1. Therefore J2j=o 


^4 • • ■, are the diagonal blocks output at the second last merging stage (cf. Figures 3 and 4). 


di-i) 


0(114) < m + A{m + n)p + J 2 'i'^o *^(^4 ^4 '"^tiere 


By recursively going back through the merging process, obtain that o(M) < (m + 4(m + n)p)l + 


HereS, 

e:io'«(s,)< 2 e:- 


q 

k-l 

q- 


= 


is an niq x Uq matrix for rriq = 


- (0) 


- ^(0) 


g = 0, 


,k — 1. Hence 


I 9 = 2s, implying (4.2). 

Part (ii) of the theorem has been supported by the merging and compression algorithm of |CGS07| . The 
algorithm has been presented and analyzed in |CGS07| (cf. also [XXG12| and [XXCBl^ l for the subclass 
of balanced p-HSS matrices, approximating the special matrix ~ exp(27r-\/—1/n) and 

/ = exp(7r-\/—1/n), denoting primitive nth and 2nth rooots of 1, respectively, but both the algorithm and 
its analysis are readily extended, and bound (4.3) follows. All the proofs can be equally applied when rows 
of the matrix M replace its columns and vice versa, and this implies part (iii). □ 


Corollary 4.2. Under the assumptions of parts (i)-(iii) of Theorem \ f.(^ suppose that kp = 0{n) and 
n+4 p = 0{\og{n)). Then a{M) = 0{{m + n)\oy {n)) and P{M) = 0{n\og^{n)). 

For our application to computations with GV matrices we must estimate a{M) and f3{M) for a little 
more general class of matrices M dehned in the next section. (Such a matrix has cyclic block tridiagonal 
part with a sufficiently small overall number of entries, say, 0((m + n) log(TO + n)), such that all blocks of 
the matrix M not overlapping this part have small rank, say, 0(log(m + n)).) The algorithms supporting 
Theorem |4.6| and Gorollary |4.2| are quite readily extended to these matrices in the next section. 


5 Extension from diagonal to tridiagonal blocks 


Example 5.1. The following matrix has eight square or rectangular diagonal blocks Eq, ..., E 7 and becomes 
block tridiagonal if we glue its lower and upper boundaries, 


M = 


/So 

Ai 

O 

O 

o 

o 

o 


\Bj 


Bo 

£1 

A 2 

O 

O 

o 

o 

o 


o 

Bi 

£2 

A 3 

o 

o 

o 

o 


000 
000 
B2 o o 

£3 Bo O 
A 4 E 4 
O A 5 E 5 

O O Ao 
000 


O Ao\ 
O O 
O O 
O O 
O O 
Bo O 

Eg Bq 
Ay Ey/ 


Define the eight tridiagonal blocks, 


2 jq — 






(5.1) 


^4 — 
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Here ^ 5 °^ andT,'^^ are six blocks of the matrix M of (5.1), while Sg'^^ andJi^f^ consist 

of two pairs of its blocks. Each pair, however, turns into a single block if we glue together the lower and upper 
boundaries of the matrix M. With the diagonal block Eg 
block column Mq such that C{Mq) = ClT^'q^). 


(c J 

and the tridiagonal block Eg we still associate a 


(c) 

The admissible block Nq , playing the role similar to that of a neutered block column of Definition 4-4 


(c) (c) 

complements the tridiagonal block Eq ' in its block column. The block Ny ^ is filled with zeros in the case of 
the matrix M of (5.1) for every q, q = 0,... ,7, but not so in the case of general 8x8 block matrix embedding 
the matrix M of 


Here are some sample unions of the tridiagonal blocks of the matrix M of (5.1), E, 


(c) 

0 , 1 :- 


= M, 



( B'j 

0 

0 









So 

Bo 

0 

0 


/ B 7 

0^ 


(B, 

o\ 

^ 0 , 1 , 2 , 3 ~ 

Ai 

0 

El 

A2 

Bi 

E2 

0 

B 2 

y(c) _ 

1 ^ 0,1 ~ 

So 

Ai 

Bo 

Si 

, and S^^^ = 

S2 

As 

B 2 

S3 


0 

0 

^3 

S3 


^0 



[0 

A4 J 


\o 

0 

0 





In Figure 5 the admissible blocks are light gray or white; two adjacent blocks of each black diagonal 
block are darker gray; the triples of these black and gray blocks form the tridiagonal blocks. The neutered 
block columns are either white or gray. 

FIGURE 5 



Let us generalize this demonstration (see Figure 5). Assume a block matrix M with k diagonal blocks 
Eq, of sizes mq x Uq, ioi q = 0,..., k — 1, and glue together its lower and upper block boundaries. Then 
each diagonal block, including the two extremal blocks Eq and Efc_i, has exactly two adjacent blocks in its 
block column: they are given by the pair of the subdiagonal and superdiagonal blocks. Define the tridiagonal 

(c) (c) (c) (c) 

blocks Eg ,..., E^_j^ of sizes mq xuqhy combining such triples of blocks where mq = mg_i mod k + rriq + 
'm-q+i mod fe, d = 0,..., fc — 1. Write m^'^) = notice that = 3m because the number of 

rows in each of the three block diagonals sums to m. Therefore 3*-°^ = ^ m^‘^)n+ < 3mn+. 
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The complements of the tridiagonal blocks in their block columns are also blocks, called admissible (cf. 
[BlOj b We call the matrix itself an extended HSS matrix, and we extend accordingly our definitions of the 
unions of blocks, recursive and balanced merging, p-neutered, balanced p-HSS, p-HSS matrices, as well as 
(^, p)-neutered, balanced (^,p)-HSS, and (^,p)-HSS matrices (cf. Definitions 4.4 4.5 and 4.6). Can we 


extend Theorem 4^ and Corollary |4.2| to the case of extended balanced p-HSS matrices M where we replace 
the integer parameters m and s by = 3 to and 3*-°^ < = 3mn^, respectively? The extension of 

part (i) of Theorem 4.6 is immediate, but in order to extend the algorithms supporting its part (ii), we must 
impose some restriction on the input matrix M. 


Definition 5.1. An extended balanced p-HSS matrix is hierarchically regular if all its diagonal blocks at 
the second factorization stage of the associated balanced merging process have full rank. This matrix is 
hierarchically well-conditioned if these blocks are also well-conditioned. 


Theorem 5.1. Suppose that the matrix M in Theorem 4-6 is replaced by an ext ende d mxn balance d p-H SS 
matrix and also suppose that the integer parameters m and s in bounds (4^) on a{M) and (4-S) on 

I3{M) are replaced by m^'4 = 3m and < 3mn_|_, respectively. Then bound (4-4) still holds, and bound 
holds if m = n and if the matrix M is hierarchically regular and hierarchically well-conditioned. 

Proof. Revisit the proof of the Theorem 4.6 by replacing the integer parameters m and according to 


the assumptions of Theorem 5.1 and verify that the proof still remains valid (use the assumption that the 
matrix M is hierarchically regular and hierarchically well-conditioned in order to extend bound (4.3)). □ 


Corollary 5.1. Under the assumptions of Theorem \5.1\ suppose that kp = 0{n) and n_|_ -I- p = 0(log(n)). 
Then a{M) = 0{{m-\-n)log^{n)) and I3{M) = 0{n log (n)). 


PART III: COMPUTATIONS WITH CV MATRICES 

AND EXTENSIONS 


6 Approximation of the CV and CV^ matrices by HSS matrices 
and algorithmic implications 


Our next goal is approximation of CV by HSS matrices, which will imply fast approximate solution of 
Problems 1-4 because in Part I we reduced them to computations with CV matrices of (1.1), and in Part 11 
we described fast computations with HSS matrices. 


6.1 Small-rank approximation of certain Cauchy matrices 

Definition 6.1. (See \CGS07[ page 1254].) For a separation bound 9 <1 and a complex separation center 
c, a pair of complex points s and t is {9, c)-separated if \ \ If 9. A pair of sets of complex numbers S and 

T is (0, c)-separated if every pair of points s G S and t gT is {9, c)-separated. 

Lemma 6.1. (See ]R83] and ICGS07\ equation (2.8)] or IPb'f .) Suppose a pair of complex points s and t is 
{9, c)-separated for 0 < d < 1 and a complex center c. Fix a positive integer p and write q = and jgl < 9. 

Then ^ ECo ^ for \qp\ = ^ and a positive integer p. 

Corollary 6.1. (Gf. IGGSOl^ Section 2.2], \Bl(f] . or \Pb^ .) Suppose that two sets of 2n distinct complex 
numbers S = {sq, • • ■: Sm-i} ond T = {to,..., tn-i} ore {9, c)-separated from one another for 0 < 0 < 1 and 
a global complex center c. Define the Gauchy matrix C = ( ^.(_^. ^ |si — c| 

denote the distance from the center c to the set S. Fix a positive integer p and define the m x p matrix 
F = (l/(si — cY^^Yi'(f=o‘~^ nx p matrix G = {{tj — (We can compute these matrices 

by using (m -\- n)p -\- m arithmetic operations.) Then 

np 

C = FG^+ E,\E\<j^^^^. ( 6 . 1 ) 
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6.2 Block partition of a Cauchy matrix 

Generally neither CV matrix of equation (O) nor its blocks of a large size have global separation centers. 
So, instead of the approximation of a CV matrix by a low-rank matrix, we seek its approximation by an 
extended balanced p-HSS matrix for a bounded integer p. At first we fix a reasonably large integer k and 
then partition the complex plane into k congruent sectors sharing the origin 0. The following definition 
induces a uniform k-partition of the knot sets S and T and thus induces a block partition of the associated 
Cauchy matrix. In the next subsection we specialize these partitions to the case of a CV matrix. 

Definition 6.2. (See Figure 6.) A{(l),4>') = {z = exp{ipy/—l) : Q<(j)<ip<cj)'< 2 tt} is the semi-open 
arc of the unit circle {z : \z\ = 1} with length (j)' — (j) and endpoints t = exp(^-\/—1) and t' = exp(^'-\/— 1 ). 

= {z = rexp{'tljy/^) : r>0, 0<(j)<ijj<(j)'< 2 tt} is the semi-open sector. f{(j),(j)') is its 

exterior. 

In Figure 6 we mark by black color an arc of the unit circle {z : |z = 1|}. The five line intervals 
[0,r], [0,c], [0, r'], [r, c], and [c, r] are shown by dotted lines. Two broken lines represent the two line 
intervals bounding the intersection of the sector r('!/;,'0') and the unit disc D(0,1) = {z : \z\ < 1}. The two 
perpendiculars from the center c onto these two bounding line intervals are also represented by broken lines. 

FIGURE 6 



Fix apositive integer write k = 2 *+, = 2q'Kjk, and = (fq+i mod k- Then \(j)'^—(l)g\ = 27r/k for all q. 

Partition the unit circle {z : \z = I|} by k equally spaced points ipo,... jfik-i into k semi-open arcs 
Aq = A{(j)q, (j)'q), cach of length 27r/fc. Define the semi-open sectors T^ = T{(j)q, (j)'g) for 9 = 0,..., fc — I, that 

is, r, = r{(j)q,(j)q+i), for 9 = 0 ,... ,A: - 2 , and T^-.i = r( 0 fc_i, 0 o)- 

Assume the polar representation Si = |si| exp{p.i\/ — 1 ) and tj = \tj\ exp{iyj\/—l). 

Notice that the knots to, ■ ■ ■ An-i have been enumerated in the counter-clockwise order of the angles 
icj, beginning with the knots in the sector r((/)o, 4>o)- Similarly re-enumerate the knots sq, ■ ■ ■, Sm-i, in the 
counter-clockwise order of the angles pj. Induce the block partition of a Cauchy matrix C = and 

its partition into block columns C = (Cg | ... | Ck-i) such that 


c — 

'-'P.9 ~ 


(si — C- ^ 




and Cq = 


(s,- — ti ) 


j ' SjG{0,...,Tl — 


for p,q = 0, 


, fc — I. 
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Furthermore, for every q, define (i) the diagonal block = Cq^g, (ii) the two adjacent blocks Cq-i mod k,q 
and Cq+i mod k,q above and below it, (iii) the tridiagonal block (made up of the block Cq and the two 
adjacent blocks), and (iv) the admissible block \ which complements the tridiagonal block in its 
block column Cq. 

(c) (c) 

If a tridiagonal block Eg is empty, then the admissible block Nq occupies the entire block column Cq, 

(c) 

that is, this block column has rank at most p. If, on the contrary, a tridiagonal block Eg ' occupies the entire 
block column Cq, then only the tridiagonal blocks in the two neighboring block columns Cq-i mod k and 
C'g+i mod k Can be nonempty, and so all the other block columns are occupied entirely by admissible blocks 
and hence have ranks at most p. 


6.3 Separation of the tridiagonal and admissible blocks of a CV matrix 

The following lemma can be readily verified (cf. Figure 6). 


Lemma 6.2. O<x<0<77<(/>'<x'< 7r/2 and write r = exp((/)-\/^), c = exp{r]^/^), and r' = 
exp((j)' \/ —1). Then |c — t| = 2sin(2^) and the distance from the point c to the sector f(x, y') is equal to 
sin('0), for if = min {?7 - y, y' - v}- 


Next we specialize the block partition of the previous subsection to the case of a CV matrix Cgj of (1.1) 


for a fixed complex / such that |/| = 1. In this case tj = fujl for ujk = exp(27r-v/^/fc), j = 0,... ,n — 1, and 
every arc Aq contains \n/k~\ or [n/fcj knots tj. 

In Figure 7, if = 4>i + 


FIGURE 7 



Theorem 6.1. (Cf. Figure 7.) Assume a uniform k-partition of the knot sets of a CV matrix above 
for k > 12. Let F^ denote the union of the sector Fg and its two adjacent sectors on both sides, that is, 
Fg = Fg-i mod fc U Fg U Fg+i mod k- Write F^ to denote the exterior of the sector F^ and write Cq to denote 
the midpoints of the arcs Aq = A{(fq, (f'q) for cf'^ = (fg+i mod k and q = 0 ,... ,k — 1. Furthermore let Sq 
denote the distance from the center Cq to the sector f^. Then, for every q, (i) 5q > \ sin(^)| and (ii) the arc 
Aq and the sector are {9, Cq)-separated for 9 — 2sin(^)/sin(^). 
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6.2 


for X = (t) = 


9 ’ 


Proof. Suppose that 1 < g < fc — 3. Then T^ = r(^q_i, ^q_|_ 2 ). Apply Lemma 
c = Cq, (f>' = 4>'q = (j)q+i, and x' = 4 ‘q+2, obtaln the theorem. Similarly prove the theorem in the cases 
where g = 0, T^, = r(^fc_i, (/) 2 ); g = k-2 and = r( 0 fc- 3 ,^o), and g = fc-1 and = T{4>k-2, 4>i)- □ 


Recall that sin(?/) « y as y « 0, and therefore 0 « 1/3 provided that the integer k is large. Notice that 
for every g the admissible block is d efined by the knots tj lying on the arc Aq and the knots Si lying 
in the sector T^, and apply Corollary 


6.1 


For every g, g = 0,..., /c — 1, write 5q = min 






then 


notice that 6q > Sq, and obtain the following result. 


Corollary 6.2. Assume a sufficiently large integer k, 2k < n, and let a uniform k-partition of the knot sets 
S and T of an m X n CV matrix C define k admissible blocks iVg'^\ ... Then all of them have the 

\E\-ranks at most p, that is, C is an extended {\E\, p)-neutered matrix, where \E\ and p satisfy bound (6.1) 
for 9 : 


1/3 and 6 = min^^g |(5,| > | sin(^)|. 


Our fc-uniform partition of the complex plane into k congruent sectors defines a desired partition of CV 
matrix into {9, Cg)-separated blocks for 0 « 1/3 or smaller. Trying to extend our results to the more general 
class of Cauchy matrices Cs,t whose all knots tj lie on the unit circle {z : \z\ = 1}, one may consider various 
other partitions of the complex plane and apply the following extension of Lemma |6. 2 1 and Theorem |6.1| 


Lemma 6.3. Assume the numbers 9, <p, (j)', and c such that Q<9<\,Q<<fi<4)'< 2 tt, and c = exp(O.5(0'+ 
is the midpoint of the arc A{cj), 4>). Write r = r{(j),cj)',9) = | sin( Let D{c,r) = {z \ \z — c\ < 
r} denote the disc on the complex plane with a center c and a radius r and let D{c,r) = {z : \z — c\ > r} 
denotes the exterior of this disc. Then the two sets A{4),4>') and D{c,r) are {9, c)-separated. 


6.4 Approximation of a CV matrix by a balanced p-HSS matrix and the com¬ 
plexity of approximate computations with CV matrices 


Let 6 ^^'^ denote the minimum distance from the centers Cq to the knots st lying in the admissible blocks 
after the Mh recursive merging. Recall that th e an gles 2TT/k of the k congruent sectors ro,...,r/c-i are 
recursively doubled in every merging. So Lemma 
h = 1, 


6.2 


implies that > sin(37r2^/fc) after the hth merging, 
1. We define the recursive merging by choosing the integers k = 2*+ and I < l+. Choose them 
such that fc/2* = 2 ’‘+~^ > 6. Then > S- = sin(^) for all h, and so i5_ « ^ for large 

integers k. Together with Corollary |6.2| these relationships imply the following result. 


Theorem 6.2. The CV matrix C of Corollary \ 6.^ as well as its transpose CV^ matrix are two ex¬ 
tended balanced {f,,p)-HSS matrices where the values f and p are linked by bound (6.1\ l for \E\ = 

9 = 2sin(^)/sin(^), and S = Sh > S- = sin(^), so that 9 « 1/3 and S- « for large integers 
k. 


Combine Corollary 5.1 with this theorem applied for k = 2^+ of order n/log(n), for p and log(l/5) of 
order log(n), and for I < such that Ij. — I > 6 (verify that in this case the assumptions of the corollary 
are satisfied), and obtain the following complexity estimates for CV matrices C and CV^ matrices 


Theorem 6.3. Assume an m x n CV matrix C and a positive ^ such that log(l/^) = 0(log(n)). Then 
Ci^{C) = 0{{m + n) log^(n)). If in addition m = n and if the matrix C is f-approximated by a hierarchically 
regular extended balanced p-HSS matrix, then I3^{C) = 0(nlog^(n)). The same bounds hold for the CW" 
matrix C'^ replacing C. 


7 Extensions and implementation 

7.1 Computations with matrices having displacement structure, polynomials, 
and rational functions 


By combining the algebraic techniques of transformation of matrix structure of 


techniques, |P151 Section 9] extends the complexity bounds of Theorems 6.3 and 


P90| with the FMM/HSS 


7.1 to generalized Cauchy 
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matrices M = (/(si — for various functions f{z) such as z~p for a positive integer p, Inz, and 

tan z, to n X n structured matrices M having the displacement structures of Toeplitz, Hankel, Cauchy and 
Vandermonde types (cf. also m). and in particular to Cauchy matrices M = Cs.t having arbitrary sets of 
knots S and T. In the latter case the approximation error bound ^ increases by a factor bounded from above 
by the condition number k{M) = ||M|| ||M+||, and the results are readily extended to Problems 3 and 4 
of multipoint rational evaluation and interpolation. Next we specify the simpler extension to computations 
with a Vandermonde matrix, its transpose, and polynomials. 


Theorem 7.1. For a positive ^ and a vector s = write V = Vs and s+ = max™g^ |si|. 

(i) Then = 0{{m + n)/5log^(n)) provided that s_|_ is bounded from above by a constant. 

(a) Suppose that, for m = n and some complex f, \f \ = 1, the CV matrix Csj has been approximated by 
a hierarchically nonsingular extended balanced {f,p)-HSS matrix. Then = 0{np^\og{n)). 

(Hi) One can extend the above bounds on Q!{(V) and fd^iV) to the solution of Problems 1 and 2 of Section^ 


Combine Theorem 6.3 equations rt3.2 


Proo f. With no loss of generality we can assume that m = n. 

(3.4) and their transposes. The matrices diag(w'^)"rQ, /^/n, and diag(/'^)"rQ are unitary, and so 

multiplication by them and by their inverses makes no impact on the output error norms. Multiplication 
by the matrix diag(s” — can increase the value f by at most a factor of 1 + s" < 1 + |I4|s-|-, while 

multiplication by the inverse of this matrix increases ^ by a factor of A = 1/max^. |j|=i min"CQ^ |s" — /"|, 
which is at most 2n for a proper choice of the value / such that |/| = 1. Then the increase by a factor of A 
would make no impact on the asymptotic bounds of Theorem [m and so we complete the proof of parts (i) 
and (ii). Equations of Problem 1 extend the proof to part (iii). □ 


7.2 Simplified implementation 

One can implement our algorithms by computing the centers Cq and the admissible blocks Nq of bounded 
ranks in the merging process, but can avoid a large part of the computations by following the recipe of the 
papers |CGS07| . |X12| . |XXG12] . and [XXGB14] . The idea is to bypass the computation of the centers Cq 
and immediately compute HSS generators for the admissible blocks Nq, defined by HSS trees. The length 
(size) of the generators at every merging stage (represented by a fixed level of the tree) can be chosen equal 
to the available upper bound on the numerical ranks of these blocks or can be adapted empirically. See 
[PLSZal Section 10.1] for a recent acceleration of this stage. 

PART IV: NUMERIGAL TESTS AND GONGLUSIONS 


8 Numerical Experiments 

Numerical experiments have been performed under our supervision in the Graduate Genter of the Gity 
University of New York by Franklin Lee and Aron Wolinetz (Section |8.1[ ) and by Liang Zhao (Section |8.2[ ). 
All computations have been performed with the IEEE standard double precision. The codes are available 
upon request. 

8.1 Experimental computation of numerical ranks of the admissible blocks of 
CV matrices 

The test programs were written in Python 3.3.3, using the Numpy 1.7.1, Scipy 0.12.1, and Sympy 0.7.3 
libraries. The tests were run on Windows 7 64-bit SPl on a Toshiba Satellite L515-S4925 with a Pentium 
Dual-Gore T4300 @ 2.10GHz x2 processor. Random numbers were generated uniformly with the language’s 
Mersenne twister over the range {a: : 0 < a: < 1} and extended to the ranges {y : a < y < b) for 
y = a + (b — a)x. 

For n = 1024,2048,4096 we computed the vectors (wj)"LQ^ of the nth roots of unity, and for every pair 
of n and h, h = 0, 1,4, we generated 100,000 instances of complex numbers sq, ... , s„_i, thus defining n x n 
GV matrices Cs.i = 


19 















We generated the knots Si = |si| exp(0i-\/^) as follows. At first we generated the angles over the 
range 0 < < 27r and the values | over the range [1 — 1 /2^, 1 +1/2^) for /i = 0,1,4 and i = 0,..., n — 1, in 

all cases independently for all i and t. Then for every vector we computed the permutation matrix 

P defining the vector = P{^i)^ZQ with the coordinates (j)Q,..., 4>n-i in the nondecreasing order. For 

every pair of the vectors (|si|)”Tg^ and {(j)i)7=o defined the vector (si)"Tg^ = (jsij exp((^ii/^))"TQ^ and the 
CV matrix C = )[[7=o- Then we fixed the integers k = 4,32,512,2048, skipped integer pairs (fc,n) 

where k < 2 or n/k < 2, and defined tridiagonal and admissible blocks by following the recipes of Section 

Finally we fixed the tolerances ^ = 10“'^ for g = 2, 3,4 and computed the ^-ranks of nonempty admissible 
blocks by applying the rank function numpy.linalg.matrix_rank(Ai, toZ). 

Tables |8.1H8.3| show the average computed values of the ^-ranks in these tests. They vary rather little, 
remaining consistently small, when we changed the parameters h, k, and and they grew very slowly when 
we doubled the matrix dimension n. 

We also computed the average norms of the admissible blocks. They ranged between 100 and 1000. 


Table 8.1: The ^-ranks of the admissible blocks for h = 0 


e 

n 

k=4 

k=32 

k=512 

0.01 

1024 

5.0 

5.0 

2.0 

0.01 

2048 

5.0 

5.0 

3.0 

0.01 

4096 

5.0 

5.0 

3.8 

0.001 

1024 

6.0 

6.0 

2.0 

0.001 

2048 

6.0 

6.0 

3.8 

0.001 

4096 

6.0 

6.3 

4.3 

0.0001 

1024 

7.0 

7.0 

2.0 

0.0001 

2048 

7.0 

7.0 

4.0 

0.0001 

4096 

7.0 

7.8 

5.0 


Table 8.2: The ^-ranks of the admissible blocks for h = 1 


e 

n 

k=4 

k=32 

k=512 

0.01 

1024 

4.0 

5.0 

2.0 

0.01 

2048 

4.0 

5.0 

3.4 

0.01 

4096 

5.0 

5.8 

4.0 

0.001 

1024 

5.0 

6.0 

2.0 

0.001 

2048 

5.0 

6.0 

4.0 

0.001 

4096 

6.0 

7.0 

4.8 

0.0001 

1024 

6.0 

7.0 

2.0 

0.0001 

2048 

6.0 

7.0 

4.0 

0.0001 

4096 

6.0 

8.0 

5.4 


8.2 Multipoint numerical evaluation of polynomials 

We tested numerical behavior of our algorithms for approximate evaluation of real and complex Gaussian 
random polynomials p{x) of degree n — 1, for n = 64,128, 256, 512,1024, 2048,4096, and generated the knots 
of the evaluation lying in the unit disc {z : jzj < 1}. 

We performed the tests on a Dell server running Windows system and using MATLAB R2014a with 
double precision. We applied the MATLAB function ”randn()” in order to generate the real polynomial 
coefficients and the real and imaginary parts separately for the complex coefficients. 
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Table 8.3: The ^-ranks of the admissible blocks for h = A 


e 

n 

k=4 

k=32 

k=512 

0.01 

1024 

4.0 

5.0 

2.0 

0.01 

2048 

4.0 

5.0 

4.0 

0.01 

4096 

4.0 

5.0 

5.0 

0.001 

1024 

4.0 

6.0 

2.0 

0.001 

2048 

5.0 

6.0 

4.0 

0.001 

4096 

5.0 

6.0 

5.9 

0.0001 

1024 

5.0 

7.0 

2.0 

0.0001 

2048 

5.0 

7.0 

4.0 

0.0001 

4096 

6.0 

7.0 

6.6 


The knots of the evaluation, Si = r x exp(27r0-\/—1), depended on two parameters r and 9. In all tests we 
defined the values 6 by applying the uniform random number generator ”rand()” to the line interval [0,1), 
and we generated the absolute values r in two ways. 

In one series of our tests we set the absolute value r to 1, thus placing the knots Si onto the unit circle 
{x : l^l = 1}, and then we displayed the test results in Tables 8.4 and 8.6 


In another series of our tests we generated the absolute value r at random by applying the uniform 
random number generator ”rand()” to the line interval [0,1], and then we displayed the test results in Tables 
8.5 and 8.7 The latter tests cover polynomial evaluation at the knots lying in the unit disc {z : \z\ < 1}, but 


can be extended to the evaluation outside it, by shifting from a polynomial p{x) of degree n to the reverse 
polynomial x^p{l/x). 

In all tables the columns “Max. Rank“ represent the maximum ^-ranks of the off-tridiagonal blocks in 
the computation, for ^ = 10“®. The columns “Error” represent the absolute difference of our computed 
values of the polynomials and the output of the MATLAB function ”polyval()” for the same inputs. 

All tests have been repeated 100 times for each n and the average results have been displayed. 

According to the test results, the computed maximum numerical rank was consistently low, implying 
that our algorithm ran fast, even though it still produced quite accurate output values. 

For comparison. Table |8^ displays the mean values and standard deviations of the output errors observed 
in our test of the polynomial evaluation algorithm of [MB72] applied to the same inputs and also with the 
IEEE standard double precision. According to these results, the algorithm has consistently performed with 
much inferior output accuracy for polynomials of degree 32 and higher. 


Table 8.4: Evaluation of Real Gaussian Polynomials on the Unit Circle 


Degree 

Max. Rank 

Error 

32 

13 

6.60 X lO"*^^ 

64 

11 

8.05 X lO"*^*^ 

128 

12 

5.88 X lO"*^^ 

256 

12 

4.01 X lO"*^^ 

512 

12 

2.27 X lO"*^^ 

1024 

12 

5.77 X 10-*^*^ 

2048 

13 

1.38 X 10"°® 

4096 

13 

2.99 X lO"'^'^ 
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Table 8.5: Evaluation of Real Gaussian Polynomial in the Unit Disk 


Degree 

Max. Rank 

Error 

32 

18 

1.90 X 10"*^“ 

64 

13 

1.47 X 10"*^“ 

128 

13 

1.13 X 10-*^“ 

256 

12 

9.09 X lO"*^^ 

512 

13 

7.05 X lO"*^^ 

1024 

12 

5.49 X lO"*^^ 

2048 

13 

4.67 X lO"*^^ 

4096 

13 

3.80 X 10"°^ 


Table 8.6: Evaluation of Complex Gaussian Polynomials on the Unit Circle 


Degree 

Max. Rank 

Error 

32 

12 

5.68 X lO"*^*^ 

64 

11 

5.05 X lO"*^^ 

128 

12 

1.41 X lO"*^^ 

256 

11 

1.42 X lO"*^^ 

512 

12 

2.73 X lO"*^^ 

1024 

12 

5.34 X lO"*^*^ 

2048 

13 

5.18 X 10"°® 

4096 

13 

1.62 X lO-'^^ 


Table 8.7: Evaluation of Complex Gaussian Polynomial in the Unit Disk 


Degree 

Max. Rank 

Error 

32 

18 

1.77 X 10-*^“ 

64 

13 

1.39 X 10-*^“ 

128 

13 

1.16 X 10-*^“ 

256 

12 

8.71 X 10"°^ 

512 

12 

6.97 X lO"*^^ 

1024 

12 

5.40 X 10"°^ 

2048 

13 

4.73 X 10"“^ 

4096 

13 

3.86 X 10"°^ 


Table 8.8: Polynomial Evaluation by Using the Algorithm of |MB72j 
(the entry “Inf’ means “beyond the range”) 



Real Gaussian 

Complex Gaussian 

Degree 

mean 

std 

mean 

std 

16 

5.19 X lO-'^y 

1.21 X 10"^*^ 

8.91 X 10"“ 

6.50 X 10"“ 

32 

4.54 X lO"*^^ 

6.72 X 10"^^ 

1.66 X lO"*^^ 

8.86 X lO-*^^ 

64 

9.47 X 10+^1 

2.99 X 10+^^ 

2.96 X 10+11 

1.22 X 10+11 

128 

2.87 X 10+'^^ 

7.21 X 10+^^ 

2.12 X lO+i*^^ 

Inf 
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9 Conclusions 


The papers [MRT05] . [CGS07| . [XXG12] . and [XXCB14] combine the FMM/HSS techniques with the trans¬ 
formation of matrix structures (traced back to [P90j ) in order to devise fast algorithms that approximate 
the solution of Toeplitz, Hankel, Toeplitz-like, and Hankel-like linear systems of equations by using nearly 
linear number of arithmetic operations performed with bounded precision. We yielded similar results (that 
is, used nearly linear number of arithmetic operations performed with bounded precision) for multiplication 
of Vandermonde and Cauchy matrices by a vector, the solution of linear systems of equations with these 
matrices, and polynomial multipoint evaluation and interpolation. This can be compared with quadratic 
arithmetic time of the known algorithms. The more involved techniques of 2D FMM should help to decrease 
our upper bounds a{M) by a logarithmic factor (cf. |B10[ Section 3.6]). 

Our Section 7.1 and the papers |P15| and |P16] cover some extensions of our techniques and results to 
computations with other structured matrices and rational functions. Our study also promises a natural 
extension to the important class of polynomial Vandermonde matrices, l^.g = (Pj(a;i))™Tg"~^, where P = 
{Pj{x))^~Q is any basis in the space of polynomials of degree less than n. This extension should exploit the 
following generalization of our equation (3.11, which reproduces [POll equation (3.6.8)], 


n—1 

Cg.t = diag(Z(si)-i)”loVp,sVp t diag{l', l{x) = H 

3=0 

For a natural further direction, we plan to recast our algorithms into the form of algorithms for com¬ 
putations with H and H2 matrices. This will enable us to apply the efficient subroutines available in the 
HLib library developed at the Max Planck Institute for Mathematics in the Sciences by L. Grasedyck and 
S. Borm, www.hlib.org, and in the H2Lib, http://www.h21ib.org/, https://github.com/H2Lib/H2Lib. 

Acknowledgements: Our research has been supported by the NSF Grants CGF 1116736 and GGF- 
1563942 and PSC CUNY Awards 67699-00 45 and 68862-00 46. We also greatly appreciate reviewers’ 
thoughtful and helpful comments. 
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